# README.txt ===============================================================================================================

This is a README file for the Archive: A database of tabulated results from American ranked-choice voting elections.

  - Versioning: 0.2.0 
  - Last Update: February 10, 2026
  - Authors: Yuki Atsusaka (atsusaka@uh.edu) [https://atsusaka.org/]
             Jordan Holbrook (jcholbrook@uh.edu) [https://jordanholbrook.github.io/]

  - Data descriptor: Please refer to our corresponding paper at {Scientific Data URL} [Preprint: https://osf.io/preprints/socarxiv/27hgx_v1]
  - Code availability: All code to reproduce our database is available at https://github.com/jordanholbrook/archive.
  - Data verse DOI: https://doi.org/10.7910/DVN/RBWU92
  - Contact: Please contact the authors for any issues or post under Issues in our GitHub repository (https://github.com/jordanholbrook/archive/issues).


# File descriptions --------------------------------------------------------------------------------------------------------

Our database offers two primary files:

  (1) "archive.csv"
  (2) "archive-reporting.csv"

These files can also be downloaded in the .dta and .RData formats at the Harvard Dataverse.

Descriptions:
(1): This is a master file whose row corresponds to a unique combination of candidate and round in a given election. In addition to vote counts and related information at the candidate-round level, this file also contains election metadata, candidate attributes, and round-level variables.

(2): This is a comprehensive dataset of tabulated results, where each row corresponds to a unique candidate within a given election and each column shows a round-level vote count. NA represents either being eliminated or the tabulation ended.


Additionally, we offer three files that only contain variables at a specific level, including

  (3): "archive-election.csv" 
  (4): "archive-round.csv"
  (5): "archive-candidate.csv"

Descriptions:
(3): This is a file that contains election-level information.
(4): This is a file that stores round-level attributes, within each election. 
(5): This is a file that stores candidate-specific variables, within each round (within each election).

The three files are hierarchically organized such that a single row in (3) may contain multiple rounds in (4), each of which in turn may contain multiple candidates in (5). The master file (1) can be generated by merging (5) and (3) via "election_id" with (4) via "election_id" and "round". In R, for example, one may merge as follows:

  library(tidyverse)
  election <- read_csv("dataverse/archive-election.csv")
  round <- read_csv("dataverse/archive-round.csv")
  candidate <- read_csv("dataverse/archive-candidate.csv")

  master <- candidate %>%
    left_join(election %>% dplyr::select(-c(source_key, source_file)),
              by = "election_id") %>%
    left_join(round %>% dplyr::select(-c(source_key, source_file)),
              by = c("election_id", "round"))


Code examples --------------------------------------------------------------------------------------------------------

Example 1: Load the data

Here, we read our master dataset and present its first few observations for demonstration. In R, users can load the package tidyverse to read the data under the name archive. Using head(), users can see the first few observations as follows.

library(tidyverse)

archive <- read_csv("archive.csv")
head(archive) # Show the first few rows  



Example 2: Targeted reporting

Users can also load our database in what we call the reporting style, where each row represents a distinct candidate, each column represents a unique round, and each cell contains a candidate-round-specific vote count. Users can then select any RCV contests by manually specifying conditions. For example, the following code extracts tabulated results from the 2022 U.S. House and Senate contests in Alaska via filter(), while showing a few necessary columns via select(). The output illustrates that in Round 3, most votes from the eliminated candidate (Begich) were transferred to a non-winner (Palin) in the U.S. House race, whereas most votes from the eliminated candidate (Chesbro) were transferred to the winner (Murkowski) in the U.S. Senate, illustrating a diversity in vote transfers in RCV.

reporting <- read_csv("archive-reporting.csv")

reporting %>%
  filter(juris == "Alaska", 
         level == "Federal", 
         year == 2022) %>%
  select(name, office, votes_r1:votes_r3)



Example 3: 

Another utility of ARCHIVE is that it allows users to perform comparative analyses of RCV elections. To illustrate, we examine the number of meaningful candidates in American RCV elections. While this is one of the central problems in political science in general, it may be particularly of interest to RCV scholars since RCV is often expected to provide more meaningful choices than (single-member districts with) the plurality rule by breaking the so-called lesser-of-the-two-evil mentality. 

To count meaningful candidates, researchers may turn to the concept of the effective number of candidates (ENC). ENC is a vote share-weighted number of candidates, which account for how ``meaningful'' each candidate is based on their voter support. For example, even when there are five candidates, ENC may be around two when most are fringe candidates who receive less than 1\% of the total ballots. In contrast, when only two candidates appear and they receive identical vote shares, ENC becomes two, implying that there are two meaningful---as in competitive---choices. Scholars may apply this well-established measure to each round of RCV contests using our database.

One notable feature of ARCHIVE is that it provides round-by-round vote counts for all candidates. This allows users to compute ENC for each round in each election. Let $v_{j}$ be candidate $j$'s vote share in a given round, where $j=1,...,J$ (with $J$ candidates) in each election. ENC is defined as $\frac{1}{\sum_{i=1}^{J}v_{j}^2}$, where we square candidate vote shares, sum them up, and take its reciprocal. In R, users may compute the round-by-round ENC with the following command. 


# Compute the effective number of candidates per round
dt <- read_csv("archive.csv")

dt_enc <- dt %>%
  group_by(election_id, round) %>%
  mutate(round_sum = sum(votes),
         round_vs = votes/round_sum,
         enc = 1/sum(round_vs^2)) %>%
  distinct(election_id, round, enc, n_cands, n_rounds)

head(dt_enc)


# Project description --------------------------------------------------------------------------------------------------------

In the past two decades, a growing number of ranked-choice voting (RCV) elections have been conducted in various jurisdictions across the United States. However, tabulated results of RCV have been reported and stored in widely different styles across places and years, making it infeasible for researchers to perform systematic analyses of vote tabulations. We introduce ARCHIVE: a database of standardized tabulated results for over 7600 round-level candidate vote counts from 514 American RCV elections, 2004-2024. To construct the database, we develop a methodological procedure based on large-language models that semi-automatically collect, standardize, and store candidate vote counts while instantly validating the resulting information. Our database releases multiple levels of data, including election metadata, round-level attributes, and candidate-level information with consistent election identifiers, allowing users to address key questions in electoral competition under RCV. To illustrate, we show how users may estimate the effective number of candidates per round.
